By May 2020, more than 1 million confirmed COVID-19 cases and 60k deaths have been reported by the Johns Hopkins University Coronavirus Resources Center. The early detection of the most relevant factors of deaths due to COVID-19 on U.S. county-level can aid in making decisions on lifestyle changes in high risk patients, distribution of public resources, and in turn reduce the CFR. This study aims to explore the most relevant health factors related to COVID-19 deaths as well as predict the overall risk using logistic regression.
The case fatality rate (CFR) will be used to measure the risk of dying from COVID-19, which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}} \]
At the beginning of the investigation, we would like to conduct some descriptive summaries of the COVID-19 and county health information. Specifically, we are interested in visualizing the confirmed cases and deaths through graphs/plots. Additionally, it would be helpful to have a basic understanding of the demographic and health information of each counties and states.
This project involves two different datasets: one includes COVID-19 cases and deaths, another one includes health-related factors on county levels.
04-04-2020.csv.gz: The COVID-19 data contain information about confirmed cases and deaths on 2020-04-04; retrieved from Johns Hopkins COVID-19 data repository. It is avaliable from this link (commit 0174f38).
us-county-health-rankings-2020.csv.gz: The 2020 County Health Ranking Data was released by County Health Rankings. The data is avaliable from the Kaggle Uncover COVID-19 Challenge (version 1).
We may want to clean the data and remove missing/error information first.
Graphical summarize the COVID-19 confirmed cases and deaths on 04/04/2020 by state.
Graphical summarize selected health status by US county.